-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup Benchmarker.prepare
(compute_connectivities_umap
)
#128
Speedup Benchmarker.prepare
(compute_connectivities_umap
)
#128
Conversation
I didn't touch the neighbors code. Maybe I merged a PR but that's about it. I'd rather lobby for a new scanpy release and bring any improvements upstream into scanpy. But cool work @jan-engelmann ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I think this should be added to the package so we don't rely on a private fn of scanpy. Also the fn is quite simple.
knn_indices, | ||
knn_dists, | ||
n_obs, | ||
n_neighbors, | ||
set_op_mix_ratio=1.0, | ||
local_connectivity=1.0, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add typing
set_op_mix_ratio=1.0, | ||
local_connectivity=1.0, | ||
): | ||
"""Sped up version of sc.neighbors._compute_connectivities_umap.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put a more general docstring? Overview of the method and that it matches how connectivies are computed in scanpy?
X, | ||
n_neighbors, | ||
None, | ||
None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I prefer using keywords everywhere
assert (new_dist == sc_dist).todense().all() | ||
assert (new_connect == sc_connect).todense().all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use np.testing
assert (new_connect == sc_connect).todense().all() | ||
|
||
|
||
def test_timing_compute_connectivities_umap(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this test is necessary if you add a reproducible script to this PR description
if isinstance(connectivities, tuple): | ||
# In umap-learn 0.4, this returns (result, sigmas, rhos) | ||
connectivities = connectivities[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this bit still necessary? What's the lower bound for scanpy? If we merge this PR we will need to make umap a direct dependency, so please also add that and potentially remove this block.
Actually I would hold off on this PR. I see some redundancies in converting back and forth from sparse distance matrices that I'd like to address first. This should eliminate the need for building the sparse distance graph here |
@jan-engelmann feel free to give a review to #129 |
Closed due to #129 @jan-engelmann I will add you as an author on the PR |
thanks @Zethson :) |
Thanks @adamgayoso I appreciate it! I like the refactoring. Very clean! 👍 |
Also thanks for the review! @adamgayoso |
Thanks for the great package!
Unfortunately
Benchmarker.prepare()
is very slow due to currently released scanpy code:sc.neighbors._compute_connectivities_umap
is highly inefficient (see this monster).One iteration of the KNN calculation takes 2:30h for 7 Million cells. A large part of this time is spent in
sc.neighbors._compute_connectivities_umap
.I see two solutions:
A Use the approach proposed in this PR
B Use new scanpy code not yet released but already on main
A This PR
The solution in this PR offers the following speedup:
B Unreleased Scanpy code
The Neighbors scanpy code on main has been thoroughly refactored by @Zethson and likely offers a way to do this efficiently. For example
scanpy.neighbors._common._get_sparse_matrix_from_indices_distances
looks promising. I can look more into this if there's interest.What do you think @Zethson @adamgayoso